Skip to content

chore: Make graph visualization more scalable#208

Open
ericfe-google wants to merge 3 commits intomainfrom
query3
Open

chore: Make graph visualization more scalable#208
ericfe-google wants to merge 3 commits intomainfrom
query3

Conversation

@ericfe-google
Copy link
Contributor

We now only include query results directly in the html when the query results are less than 100 KB. For larger query results, we store only the reference to the destination table in the HTML, and have the python code re-read the query results from the destination table during the callback.

Also, added a hard limit of 5 MB in the query result size, beyond which, graph visualization is not supported altogether.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-magics API. labels Feb 2, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @ericfe-google, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scalability of the graph visualization feature by introducing intelligent data handling based on query result size. It prevents embedding excessively large datasets directly into the HTML, opting instead for a callback mechanism to fetch data from BigQuery for medium-sized results, and imposing a hard limit for extremely large ones. This change improves performance and resource usage, making the visualization more robust for diverse data volumes.

Highlights

  • Scalable Graph Visualization: Implemented a mechanism to handle large query results for graph visualization more efficiently, preventing direct embedding of excessively large datasets into HTML.
  • Size-Based Data Handling: Query results under 100KB are embedded directly in the HTML. For results between 100KB and 5MB, only a reference to the BigQuery destination table is passed, and the data is re-read via a callback when the visualization is requested.
  • Hard Limit for Visualization: Introduced a 5MB hard limit on query result size, beyond which graph visualization is not supported, displaying an error message instead.
  • Refactored Client Creation and User Agent Logic: Moved BigQuery client creation and user agent generation logic from bigquery.py to graph_server.py for improved modularity and reuse.
  • JSON Size Estimation: Added a new utility function _estimate_json_size to approximate the JSON size of a Pandas DataFrame without fully serializing it, enabling efficient size checks.
Changelog
  • bigquery_magics/bigquery.py
    • Removed client_info, environment, version imports, and _get_user_agent function, moving them to graph_server.py.
    • Refactored BigQuery client creation into graph_server.create_bq_client.
    • Introduced MAX_GRAPH_VISUALIZATION_SIZE (5MB) and MAX_GRAPH_VISUALIZATION_QUERY_RESULT_SIZE (100KB) constants.
    • Added _estimate_json_size function for approximating DataFrame JSON size.
    • Modified _add_graph_widget to conditionally embed query results or pass destination table references based on estimated size.
    • Updated _colab_query_callback to expect query_result within params.
    • Updated _make_bq_query and _make_bqstorage_client to use the new _add_graph_widget signature and graph_server._get_user_agent.
  • bigquery_magics/graph_server.py
    • Added imports for copy, IPython, client_info, bigquery, environment, config, version.
    • Moved _get_user_agent and create_bq_client functions from bigquery.py.
    • Modified handle_post_query to dynamically fetch query results from BigQuery using destination table references if not directly embedded in the initial request.
  • tests/unit/bigquery/test_bigquery.py
    • Updated various test setups to include _project in bigquery_magics.context.
    • Enhanced mocks for QueryJob to include configuration.destination attributes.
    • Added assertions for args serialization in HTML content.
    • Introduced new tests: test_bigquery_graph_size_exceeds_max, test_bigquery_graph_size_exceeds_query_result_max, and test_bigquery_graph_with_args_serialization.
    • Adjusted test_colab_query_callback to match the updated _colab_query_callback parameter structure.
  • tests/unit/test_graph_server.py
    • Updated test_post_query to reflect the new query_result key in the params data structure.
Activity
  • The author ericfe-google created this pull request.
  • The pull request introduces significant changes to how graph visualization data is handled, focusing on scalability.
  • Code refactoring involved moving client creation and user agent logic to a dedicated graph_server.py module.
  • New logic was implemented in bigquery.py to estimate JSON size and conditionally embed data or pass table references.
  • The graph_server.py was updated to handle callbacks for re-reading data from BigQuery for larger results.
  • Comprehensive unit tests were added and updated to cover the new size limits and data serialization mechanisms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more scalable approach for graph visualization by conditionally embedding query results in the HTML based on their size. For larger results, it fetches them on-demand from a temporary BigQuery table. The changes include refactoring client creation logic for better code organization and adding new size estimation functions. My review focuses on a bug in the size estimation logic and a maintainability improvement in the server-side handling of on-demand data fetching. The added tests are comprehensive and cover the new functionality well.

@ericfe-google ericfe-google marked this pull request as ready for review February 2, 2026 22:32
@ericfe-google ericfe-google requested review from a team as code owners February 2, 2026 22:32
@ericfe-google ericfe-google force-pushed the query3 branch 7 times, most recently from f7d3ba0 to 88ab993 Compare February 3, 2026 00:08
We now only include query results directly in the html when the query results are less than 100 KB. For larger query results, we store only the reference to the destination table in the HTML, and have the python code re-read the query results from the destination table during the callback.

Also, added a hard limit of 5 MB in the query result size, beyond which, graph visualization is not supported altogether.
client_options=bigquery_client_options,
location=args.location,
bq_client = core.create_bq_client(
args.project, args.bigquery_api_endpoint, args.location
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since these are all string arguments (not that we've enabled type checking, anyway), passing by keyword could prevent accidentally passing the wrong one to the wrong value.

Suggested change
args.project, args.bigquery_api_endpoint, args.location
project=args.project,
bigquery_api_endpoint=args.bigquery_api_endpoint,
location=args.location,

return " ".join(identities)


def create_bq_client(project: str, bigquery_api_endpoint: str, location: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more optional: we could force these to be keyword arguments. That's a practice the Vertex team follows, which is helpful because it means we could theoretically reorder the arguments without breaking changes. I'm of mixed opinion about forcing that on users, but I do think it's a useful practice for internal functions like this.

Suggested change
def create_bq_client(project: str, bigquery_api_endpoint: str, location: str):
def create_bq_client(*, project: str, bigquery_api_endpoint: str, location: str):

Comment on lines +15 to +21
import copy
from google.api_core import client_info
from google.cloud import bigquery
import IPython # type: ignore
from bigquery_magics import environment
import bigquery_magics.config
import bigquery_magics.version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP-8:

Imports should be grouped in the following order:

  1. Standard library imports.
  2. Related third party imports.
  3. Local application/library specific imports.

You should put a blank line between each group of imports.

https://peps.python.org/pep-0008/#imports

In this case:

Suggested change
import copy
from google.api_core import client_info
from google.cloud import bigquery
import IPython # type: ignore
from bigquery_magics import environment
import bigquery_magics.config
import bigquery_magics.version
import copy
from google.api_core import client_info
from google.cloud import bigquery
import IPython # type: ignore
from bigquery_magics import environment
import bigquery_magics.config
import bigquery_magics.version

Comment on lines +632 to +633
MAX_GRAPH_VISUALIZATION_SIZE = 5000000
MAX_GRAPH_VISUALIZATION_QUERY_RESULT_SIZE = 100000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find it helpful to group the 0s in Python to better understand the scale at a glance.
Also, "size" is pretty ambiguous. Please rename to include the units. For example _BYTES.

Suggested change
MAX_GRAPH_VISUALIZATION_SIZE = 5000000
MAX_GRAPH_VISUALIZATION_QUERY_RESULT_SIZE = 100000
MAX_GRAPH_VISUALIZATION_BYTES = 5_000_000
MAX_GRAPH_VISUALIZATION_QUERY_RESULT_BYTES = 100_000



def _estimate_json_size(df: pandas.DataFrame) -> int:
"""Approximates the length of df.to_json(orient='records')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not a perfect estimate, but pandas provides https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html

Could we use that as the starting point, instead? How accurate do you need to be?

@@ -0,0 +1,73 @@
# Copyright 2024 Google LLC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 Google LLC
# Copyright 2026 Google LLC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-magics API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants